Homework 1

In this homework you will implement a numerical approach to gradient descent and use it to implement the perceptron algorithm. That will take place in stages described below. But we'll start by describing how to perform gradient descent numerically.

The method of finite differences

Given a function $f(x)$, the analytical approach to gradient descent - i.e., to finding the value of $x$ that minimizes $f(x)$ - is to compute $f'(x)$ and iterate as follows, where $\alpha \in (0, 1]$ is the learning rate:

If you cannot compute $f'(x)$ analytically, you can estimate it as follows, for sufficiently small $\epsilon$:

$f'(x) \approx \frac{f(x + \epsilon) - f(x - \epsilon)}{2\epsilon}$

The method of finite differences makes the assumption that over very small intervals the function behaves linearly and the slope of the tangent line can be estimated as the difference between function values at the ends of the interval divided by the interval width.

For example, the slope of $f(x) = x^2$ at $x = 1$ can be estimated as $f'(1) = \frac{(1 + 0.001)^2 - (1 - 0.001)^2}{2 * 0.001}$.

Note that the number above is very close to what you would get analytically by taking derivatives: $f'(x) = 2x$ so $f'(1) = 2*1 = 2$.

You'll use the method of finite differences to compute the derivative of a loss function with respect to the weights of a perceptron.

Below is a simple implementation of the method of finite differences for a univariate function. It is overly simple, running for a fixed number of iterations, and assuming constants for $\epsilon$ and $\alpha$, but it works for simple cases.

Perceptron class

Below is a simple Perceptron class that you can use in this assignment. Feel free to use it as is, make changes, or throw it out and write your own. It creates a weight vector with small random numbers when initialized, and has methods to compute the activation for an input and to produce a binary class label for an input.

Note that the class below does not maintain an explicit bias term $b$. You can add one or, better yet, make sure that all inputs, $x$, have a 1 in one of the positions.

Task 1: The hinge loss

Fill in the function below to compute the hinge loss. Recall that the hinge loss is 0 if the activation and the class label have the same sign, otherwise it is -y * activation.

Task 2: One step of gradient descent

Fill in the function below which takes the following arguments:

The function must return a new set of weights to use in the perceptron after performing one step of gradient descent update using the training example and loss function. To do that it will:

Note: Be careful not to modify the weights of the preceptron in place in the routine below.

Some training data

The code below generates a simple 2D dataset of n positive examples followed by n negative examples. The cell after that plots them. The code also prepends a 1 in each example so that the bias term will simply correspond to the first weight.

Task 3: Full gradient descent

If you've done everthing above correctly, the code below will perform gradient descent to train the classifier.

Modify this code so that it runs until one epoch produces no classification errors rather than running for a fixed number of iterations.

Task 4: Plot some hyperplanes

Run full gradient descent 5 times and write a routine to convert the weights into slope/intercept form. Then use the function below to plot the hyperplanes learned by the perceptron along with the data in one graph. The second cell below shows how that can be done. Write a paragraph explaining what you see in the plot, touching on how much variation there is from run to run and whether the separators seem like "good" ones.

Task 5: Do task 4 again with a new loss

Repeat The previous task, generating 5 plots of the separators correspond to hyperplanes, but this time use the loss function below.

How are the results different? Write a brief paragraph explaining why the results look different.